Converting long-format dataframes to wide-format

The purpose of this notebook is to demonstrate the conversion of long-format data into wide-format. Long-format data contains one row per available alternative per choice situation. In contrast, wide-format data contains one row per choice situation. PyLogit and other software packages (e.g. mlogit in R) use data that is in long-format. However, other software packages, such as Statsmodels in Python or Python BIOGEME, use data that is in wide-format.

Because different software packages have different data format requirements, it is useful to be able to convert one's data from one format to another. Other PyLogit example notebooks (such as the "Main PyLogit Example") demonstrate how to take data from wide-format and convert it into long-format. This notebook will demonstrate the reverse process: taking data from long-format and converting it into wide-format.

The dataset being used in this example is the "Travel Mode Choice" dataset from Greene and Hensher. It is described on the statsmodels website, and their description is reproduced below in full.

    The data, collected as part of a 1987 intercity mode choice study, are a sub-sample of 210 non-business
    trips between Sydney, Canberra and Melbourne in which the traveler chooses a mode from four alternatives
    (plane, car, bus and train). The sample, 840 observations, is choice based with over-sampling of the
    less popular modes (plane, train and bus) and under-sampling of the more popular mode, car. The level of
    service data was derived from highway and transport networks in Sydney, Melbourne, non-metropolitan N.S.W.
    and Victoria, including the Australian Capital Territory.

    Number of observations: 840 Observations On 4 Modes for 210 Individuals.
    Number of variables: 8
    Variable name definitions::

        individual = 1 to 210
        mode =
            1 - air
            2 - train
            3 - bus
            4 - car
        choice =
            0 - no
            1 - yes
        ttme = terminal waiting time for plane, train and bus (minutes); 0
               for car.
        invc = in vehicle cost for all stages (dollars).
        invt = travel time (in-vehicle time) for all stages (minutes).
        gc = generalized cost measure:invc+(invt*value of travel time savings)
            (dollars).
        hinc = household income ($1000s).
        psize = traveling group size in mode chosen (number).


    Source

    Greene, W.H. and D. Hensher (1997) Multinomial logit and discrete choice models in Greene, W. H. (1997)
    LIMDEP version 7.0 user’s manual revised, Plainview, New York econometric software, Inc. Download from
    on-line complements to Greene, W.H. (2011) Econometric Analysis, Prentice Hall, 7th Edition (data table
    F18-2) http://people.stern.nyu.edu/wgreene/Text/Edition7/TableF18-2.csv



In [1]:

    
# To access the Travel Mode Choice data
import statsmodels.datasets

# To perform the dataset conversion
import pylogit as pl

Load the needed dataset



In [3]:

    
# Access the dataset
mode_data = statsmodels.datasets.modechoice.load_pandas()
# Get a pandas dataframe of the mode choice data
long_df = mode_data["data"]
# Look at the dataframe to ensure that it loaded correctly
long_df.head()

Create the needed variables for the conversion function.

The function in PyLogit that is used to convert long-format data to wide-format data is "convert_long_to_wide," and it can be accessed through "pl.convert_long_to_wide". The docstring for the function contains all of the information necessary to perform the conversion, but we will leave it to readers to view the docstring at their own leisure. For now, we will simply create the needed objects/arguments for the function.

In particular, we will need the following 7 objects:

ind_vars
alt_specific_vars
subset_specific_vars
obs_id_col
alt_id_col
choice_col
alt_name_dict

The cells below will show exactly what these objects are.



In [10]:

    
# ind_vars is a list of strings denoting the column
# headings of data that varies across choice situations,
# but not across alternatives. In our data, this is
# the household income and party size.
individual_specific_variables = ["hinc", "psize"]

# alt_specific_vaars is a list of strings denoting the
# column headings of data that vary not only across
# choice situations but also across all alternatives.
# These are columns such as the "level of service"
# variables.
alternative_specific_variables = ["invc", "invt", "gc"]

# subset_specific_vars is a dictionary. Each key is a
# string that denotes a variable that is subset specific.
# Each value is a list of alternative ids, over which the
# variable actually varies. Note that subset specific
# variables vary across choice situations and across some
# (but not all) alternatives. This is most common when
# using variables that are not meaningfully defined for
# all alternatives. An example of this in our dataset is
# terminal time ("ttme"). This variable is not meaningfully
# defined for the "car" alternative. Therefore, it is always
# zero. Note "4" is the id for the "car" alternative
subset_specific_variables = {"ttme": [1, 2, 3]}

# obs_id_col is the column denoting the id of the choice
# situation. If one was using a panel dataset, with multiple
# choice situations per unit of observation, the column
# denoting the unit of observation would be listed in
# ind_vars (i.e. with the individual specific variables)
observation_id_column = "individual"

# alt_id_col is the column denoting the id of the alternative
# corresponding to a given row.
alternative_id_column = "mode"

# choice_col is the column denoting whether the alternative
# on a given row was chosen in the corresponding choice situation
choice_column = "choice"

# Lastly, alt_name_dict is not necessary. However, it is useful.
# It records the names corresponding to each alternative, if there
# are any, and allows for the creation of meaningful column names
# in the wide-format data (such as when creating the columns
# denoting the available alternatives in each choice situation).
# The keys of alt_name_dict are the unique alternative ids, and
# the values are the names of each alternative.
alternative_name_dict = {1: "air",
                         2: "train",
                         3: "bus",
                         4: "car"}

Create the wide-format dataframe



In [12]:

    
# Finally, we can create the wide format dataframe
wide_df = pl.convert_long_to_wide(long_df,
                                  individual_specific_variables,
                                  alternative_specific_variables,
                                  subset_specific_variables,
                                  observation_id_column,
                                  alternative_id_column,
                                  choice_column,
                                  alternative_name_dict)

# Let's look at the created dataframe, transposed for easy viewing
wide_df.head().T









    Out[12]:






  
    
      
      0
      1
      2
      3
      4
    
  
  
    
      individual
      1
      2
      3
      4
      5
    
    
      choice
      4
      4
      4
      4
      4
    
    
      availability_air
      1
      1
      1
      1
      1
    
    
      availability_train
      1
      1
      1
      1
      1
    
    
      availability_bus
      1
      1
      1
      1
      1
    
    
      availability_car
      1
      1
      1
      1
      1
    
    
      hinc
      35
      30
      40
      70
      45
    
    
      psize
      1
      2
      1
      3
      2
    
    
      invc_air
      59
      58
      115
      49
      60
    
    
      invc_train
      31
      31
      98
      26
      32
    
    
      invc_bus
      25
      25
      53
      21
      26
    
    
      invc_car
      10
      11
      23
      5
      8
    
    
      invt_air
      100
      68
      125
      68
      144
    
    
      invt_train
      372
      354
      892
      354
      404
    
    
      invt_bus
      417
      399
      882
      399
      449
    
    
      invt_car
      180
      255
      720
      180
      600
    
    
      gc_air
      70
      68
      129
      59
      82
    
    
      gc_train
      71
      84
      195
      79
      93
    
    
      gc_bus
      70
      85
      149
      81
      94
    
    
      gc_car
      30
      50
      101
      32
      99
    
    
      ttme_air
      69
      64
      69
      64
      64
    
    
      ttme_train
      34
      44
      34
      44
      44
    
    
      ttme_bus
      35
      53
      35
      53
      53

As we can see above, PyLogit does a few things automatically. First, using the names provided in alt_name_dict, it will add suffixes to the alternative specific variables and the subset specific variables. These suffixes record what alternative, the given column of data is referring to. Secondly, when dealing with subset specific variables, PyLogit will only create columns of data for alternatives over which the variable actually varies. Lastly, PyLogit automatically creates columns that denote the availability of each alternative for each choice situation. These columns are suffixed to denote the alternatives that they correspond to, and they are inferred automatically from the rows present in the long-format data.

Also, there is a "null_value" keyword that one can use in the conversion function. This is useful when one has alternative specific variables, and not all alternatives are available in all choice situations. In this setting, one may want to specify a value for the missing data, such as null, -999, etc. The "null_value" keyword argument allows one to do this.

	individual	mode	choice	ttme	invc	invt	gc	hinc	psize
0	1	1	0	69	59	100	70	35	1
1	1	2	0	34	31	372	71	35	1
2	1	3	0	35	25	417	70	35	1
3	1	4	1	0	10	180	30	35	1
4	2	1	0	64	58	68	68	30	2

	0	1	2	3	4
individual	1	2	3	4	5
choice	4	4	4	4	4
availability_air	1	1	1	1	1
availability_train	1	1	1	1	1
availability_bus	1	1	1	1	1
availability_car	1	1	1	1	1
hinc	35	30	40	70	45
psize	1	2	1	3	2
invc_air	59	58	115	49	60
invc_train	31	31	98	26	32
invc_bus	25	25	53	21	26
invc_car	10	11	23	5	8
invt_air	100	68	125	68	144
invt_train	372	354	892	354	404
invt_bus	417	399	882	399	449
invt_car	180	255	720	180	600
gc_air	70	68	129	59	82
gc_train	71	84	195	79	93
gc_bus	70	85	149	81	94
gc_car	30	50	101	32	99
ttme_air	69	64	69	64	64
ttme_train	34	44	34	44	44
ttme_bus	35	53	35	53	53

	individual	mode	choice	ttme	invc	invt	gc	hinc	psize
0	1	1	0	69	59	100	70	35	1
1	1	2	0	34	31	372	71	35	1
2	1	3	0	35	25	417	70	35	1
3	1	4	1	0	10	180	30	35	1
4	2	1	0	64	58	68	68	30	2

	individual	mode	choice	ttme	invc	invt	gc	hinc	psize
0	1	1	0	69	59	100	70	35	1
1	1	2	0	34	31	372	71	35	1
2	1	3	0	35	25	417	70	35	1
3	1	4	1	0	10	180	30	35	1
4	2	1	0	64	58	68	68	30	2